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1.You were asked to setup a new all-purpose cluster, but the cluster is unable to start which of the 
following steps do you need to take to identify the root cause of the issue and the reason why the cluster 
was unable to start? 

A. Check the cluster driver logs 

B. Check the cluster event logs 

(Correct) 

C. Workspace logs 

D. Storage account 

E. Data plane 

Answer: B 

Explanation: 

Cluster event logs are very useful, to identify issues pertaining to cluster availability. Cluster may not start 
due to resource limitations or issues with the cloud providers. 

Some of the common issues include a subnet for compute VM reaching its limits or exceeding the 
subscription or cloud account CPU quota limit. 

Here is an example where the cluster did not start due to subscription reaching the quota limit on a certain 
type of cpu cores for a VM type. 


ose Cluster Ul Preview Provide feedback 


Test All Purpose Cluster 


Configuration Notet h { ent ark rive i Met seh r Masters 


© 


Graphical user 

interface, text, application, email 
Description automatically generated 
Click on event logs 


Test All Purpose Cluster > | sae | 
TERA Pa eg GIS MET (25490 GOT Ow MAS Pease aret GRA (Re RT Pastor 

Graphical user 

interface, text, application, email 


Description automatically generated 
Click on the message to see the detailed error message on why the cluster did not start. 
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Cluster terminated. Reason: Azure Quota Exceeded Exception 


Graphical user 
interface, text, application, email 
Description automatically generated 


2.A SQL Dashboard was built for the supply chain team to monitor the inventory and product orders, but 
all of the timestamps displayed on the dashboards are showing in UTC format, so they requested to 
change the time zone to the location of New York. 

How would you approach resolving this issue? 

A. Move the workspace from Central US zone to East US Zone 

B. Change the timestamp on the delta tables to America/New York format 

C. Change the spark configuration of SQL endpoint to format the timestamp to Ameri-ca/New_York 
D. Under SQL Admin Console, set the SQL configuration parameter time zone to Ameri-ca/New_York 
E. Add SET Timezone = America/New_ York on every of the SQL queries in the dashboard. 

Answer: D 

Explanation: 

The answer is, Under SQL Admin Console, set the SQL configuration parameter time zone to 
America/New_York 

Here are steps you can take this to configure, so the entire dashboard is changed without 

changing individual queries 

Configure SQL parameters 

To configure all warehouses with SQL parameters: 

1.Click Settings at the bottom of the sidebar and select SQL Admin Console. 

2.Click the SQL Warehouse Settings tab. 

3.In the SQL Configuration Parameters textbox, specify one key-value pair per line. Sepa-rate the name 
of the parameter from its value using a space. For example, to ena-ble ANSI_ MODE: 
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SQL Configuration Parameters 


SQL Configuration Parameters let you override the default behavior for all sessions with all endpoints. Session parameters can be overridden for a single session with the 


SET command 


@ Warning: When you save a change to the SQL configuration parameters, clusters allocated to running SQL endpoints are restarted 


SQL Configuration Parameters ANST_ MODE true 


Graphical user 

interface, text, application 

Description automatically generated 

Similarly, we can add a line in the SQL Configuration parameters timezone America/New_York 
SQL configuration parameters | Databricks on AWS 


3.You are currently asked to work on building a data pipeline, you have noticed that you are currently 
working on a very large scale ETL many data dependencies, which of the following tools can be used to 
address this problem? 

A. AUTO LOADER 

B. JOBS and TASKS 

C. SQL Endpoints 

D. DELTALIVE TABLES 

E. STRUCTURED STREAMING with MULTI HOP 

Answer: D 

Explanation: 

The answer is, DELTA LIVE TABLES 

DLT simplifies data dependencies by building DAG-based joins between live tables. Here is a view of how 
the dag looks with data dependencies without additional meta data, 

1.create or replace live view customers 

2.select * from customers; 

3. 

4.create or replace live view sales_orders_raw 

5.select * from sales orders; 

6. 

7.create or replace live view sales orders cleaned 

8.as 

9.select sales.* from 

10.live.sales orders raw s 

11. join live.customers c 

12.0n c.customer_id = s.customer id 
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13.where c.city = 'LA’; 

14. 

15.create or replace live table sales orders in la 
16.selects from sales orders cleaned; 

Above code creates below dag 


= customers 
Completed - 3s 


sales_orders_clea... © ™ sales order in la © 


'mpleted + 2 Completed - 3 


= sales_orders_raw 
( npleted - 28 


Documentation on DELTA LIVE TABLES, 
https://databricks.com/product/delta-live-tables 
https://databricks.com/blog/2022/04/05/announcing-generally-availability-of-databricks-delta-live-tables-d 
It.html 

DELTA LIVE TABLES, addresses below challenges when building ETL processes 
. Complexities of large scale ETL 

. Hard to build and maintain dependencies 

. Difficult to switch between batch and stream 

. Data quality and governance 

. Difficult to monitor and enforce data quality 

. Impossible to trace data lineage 

. Difficult pipeline operations 

. Poor observability at granular data level 

. Error handling and recovery is laborious 


TO wWwWwoeoenaon > 


4.When you drop a managed table using SQL syntax DROP TABLE table_name how does it impact 
metadata, history, and data stored in the table? 

A. Drops table from meta store, drops metadata, history, and data in storage. 

B. Drops table from meta store and data from storage but keeps metadata and history in storage 

C. Drops table from meta store, meta data and history but keeps the data in storage 

D. Drops table but keeps meta data, history and data in storage 

E. Drops table and history but keeps meta data and data in storage 

Answer: A 

Explanation: 
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For a managed table, a drop command will drop everything from metastore and storage. 


See the below image to understand the differences between dropping an external table. 
External DELTA Table 


Control Pane Data plane 
{(Databricks) (Customer - Cloud object storage) 


External DELTA Table | 


f 
Managed DELTA Table N j j 
DROP TABLE test 


Diagram 
Description automatically generated 


5.Which of the following approaches can the data engineer use to obtain a version-controllable 
con-figuration of the Job’s schedule and configuration? 

A. They can link the Job to notebooks that are a part of a Databricks Repo. 

B. They can submit the Job once on a Job cluster. 

C. They can download the JSON equivalent of the job from the Job’s page. 

D. They can submit the Job once on an all-purpose cluster. 

E. They can download the XML description of the Job from the Job’s page 

Answer: D 


6.What is the underlying technology that makes the Auto Loader work? 
A. Loader 

B. Delta Live Tables 

C. Structured Streaming 

D. DataFrames 

E. Live DataFames 

Answer: C 


7.You are currently looking at a table that contains data from an e-commerce platform, each row contains 
a list of items(Item number) that were present in the cart, when the customer makes a change to the cart 
the entire information is saved as a separate list and appended to an existing list for the duration of the 
customer session, to identify all the items customer bought you have to make a unique list of items, you 
were asked to create a unique item’s list that was added to the cart by the user, fill in the blanks of below 
query by choosing the appropriate higher-order function? 
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Note: See below sample data and expected output. 
Schema: cartld INT, items Array<INT> 


Sample data: 


card ————“(i‘“‘C:*sS™SCCs~idiitems O 
PD ff 100,200,300], [1,250,300 
ae : 


10,150,200, 300 d 


Expected output] 


i o oooO ft, 100,200,300,250 
2] 10, 150,200,300,210,350 


Fill in the blanks: 

Fill in the blanks: 

SELECT cartld, _(_(items)) FROM carts 

A. ARRAY_UNION, ARRAY_DISCINT 

B. ARRAY_DISTINCT, ARRAY_UNION 

C. ARRAY_DISTINCT, FLATTEN 

D. FLATTEN, ARRAY_DISTINCT 

E. ARRAY_DISTINCT, ARRAY_FLATTEN 

Answer: C 

Explanation: 

FLATTEN -> Transforms an array of arrays into a single array. 
ARRAY DISTINCT -> The function returns an array of the same type as the input argument where all 
duplicate values have been removed. 


Input FLATTEN(items) ARRAY_DISTINCT(FLATTEN(items)) 
cartId | items [card | items cartId items 
1 [[1.100,200.300], (1,250,300]] —> |! [1,100,200,300,1,250,300] — 1 [1,100,200,300,250] 
2 [[10,150,200,300], [1,210,300],[350]] [2 [10,150,200,300,1,210,300,350] 2 [10,150,200,300,210,350] 


Table 
Description automatically generated 


8.When building a DLT s pipeline you have two options to create a live tables, what is the main difference 
between CREATE STREAMING LIVE TABLE vs CREATE LIVE TABLE? 

A. CREATE STREAMING LIVE table is used in MULTI HOP Architecture 

B. CREATE LIVE TABLE is used when working with Streaming data sources and Incremental data 

C. CREATE STREAMING LIVE TABLE is used when working with Streaming data sources and 
Incremental data 

D. There is no difference both are the same, CREATE STRAMING LIVE will be deprecated soon 

E. CREATE LIVE TABLE is used in DELTA LIVE TABLES, CREATE STREAMING LIVE can only used in 
Structured Streaming applications 

Answer: C 

Explanation: 

The answer is, CREATE STREAMING LIVE TABLE is used when working with Streaming data sources 
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and Incremental data 


9.You are tasked to set up a set notebook as a job for six departments and each department can run the 
task parallelly, the notebook takes an input parameter dept number to process the data by department, 
how do you go about to setup this up in job? 

A. Use a single notebook as task in the job and use dbutils.notebook.run to run each note-book with 
parameter in a different cell 

B. A task in the job cannot take an input parameter, create six notebooks with hardcoded dept number 
and setup six tasks with linear dependency in the job 

C. Atask accepts key-value pair parameters, creates six tasks pass department number as parameter 
foreach task with no dependency in the job as they can all run in parallel. (Correct) 

D. A parameter can only be passed at the job level, create six jobs pass department number to each job 
with linear job dependency 

E. A parameter can only be passed at the job level, create six jobs pass department number to each job 
with no job dependency 

Answer: C 

Explanation: 

Here is how you setup 

Create a single job and six tasks with the same notebook and assign a different parameter for each task, 
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Task name * @ 


sales 


Type * Source * @ 


Notebook 


Path * @ 
/Rep0s i |i-clemo/notebooks/job-demo B wa) 


You are using a Local repository. To automatically track the upstream repository in staging or 
production Jobs, select Git in the Source field. 


Cluster * @ 


Process all departments cluster (28.00 GB | 8 Cores | DBR 10.4 LTS | Spark 3.2.1 | S... # 


Parameters @ UI | JSON 
{ 


"department": "accounting" 


Depends on 


Select task dependencies... 


v Advanced options 


Graphical user interface, text, application, email 

Description automatically generated 

All tasks are added in a single job and can run parallel either using single shared cluster or with individual 
clusters. 


accounting distribution finance marketing operations sales 


O _mcii.demo/notebocksjob.cem: D _micidemo/notebooks/job-deme DI. miel-demo/notebookafob-dem D — micidemainotebook 


Graphical user 
interface, application, Teams 
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Description automatically generated 


10.Which of the following commands can be used to run one notebook from another notebook? 
A. notebook.utils.run("full notebook path") 

B. execute.utils.run("full notebook path") 

C. dbutils.notebook.run("full notebook path") 

D. only job clusters can run notebook 

E. spark.notebook.run("full notebook path") 

Answer: C 

Explanation: 

The answer is dbutils.notebook.run(" full notebook path ") 

Here is the full command with additional options. 

run(path: String, timeout_seconds: int, arguments: Map): String 
1.dbutils.notebook.run("ful-notebook-name", 60, {"argument": "data", "argument2": "data2", 


yy 


11.You have configured AUTO LOADER to process incoming IOT data from cloud object storage every 15 
mins, recently a change was made to the notebook code to update the processing logic but the team later 
realized that the notebook was failing for the last 24 hours, what steps team needs to take to reprocess 
the data that was not loaded after the notebook was corrected? 

A. Move the files that were not processed to another location and manually copy the files into the 
ingestion path to reprocess them 

B. Enable back_fill = TRUE to reprocess the data 

C. Delete the checkpoint folder and run the autoloader again 

D. Autoloader automatically re-processes data that was not loaded 

E. Manually re-load the data 

Answer: D 

Explanation: 

The answer is, 

Autoloader automatically re-processes data that was not loaded using the checkpoint. 


12.John Smith is a newly joined team member in the Marketing team who currently has access read 
access to sales tables but does not have access to delete rows from the table, which of the following 
commands help you accomplish this? 

A. GRANT USAGE ON TABLE table_name TO john.smith@marketing.com 

B. GRANT DELETE ON TABLE table_name TO john.smith@marketing.com 

C. GRANT DELETE TO TABLE table_name ON john.smith@marketing.com 

D. GRANT MODIFY TO TABLE table_name ON john.smith@marketing.com 

E. GRANT MODIFY ON TABLE table_name TO john.smith@marketing.com 

Answer: E 

Explanation: 

The answer is GRANT MODIFY ON TABLE table_name TO john.smith@marketing.com, please note 
INSERT, UPDATE, and DELETE are combined into one role called MODIFY. Below are the list of 
privileges that can be granted to a user or a group, SELECT: gives read access to an object. 
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CREATE: gives the ability to create an object (for example, a table in a schema). 

MODIFY: gives the ability to add, delete, and modify data to or from an object. 

USAGE: does not give any abilities, but is an additional requirement to perform any action on a schema 
object. 

READ_METADATA: gives the ability to view an object and its metadata. 

CREATE NAMED FUNCTION: gives the ability to create a named UDF in an existing catalog or schema. 
MODIFY_CLASSPATH: gives the ability to add files to the Spark classpath. 

ALL PRIVILEGES: gives all privileges (is translated into all the above privileges 


13.Which of the following SQL command can be used to insert or update or delete rows based on a 
condition to check if a row(s) exists? 

A. MERGE INTO table_name 

B. COPY INTO table_name 

C. UPDATE table_name 

D. INSERT INTO OVERWRITE table_name 

E. INSERT IF EXISTS table_name 

Answer: A 

Explanation: 

here is the additional documentation for your review. 
https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html 
1.MERGE INTO target_table_name [target_alias] 

2. USING source table reference [source alias] 

3. ON merge condition 

4. [WHEN MATCHED [AND condition ] THEN matched_action ] [...] 

5. [WHEN NOT MATCHED [AND condition ] THEN not_matched_action ] [...] 
6. 

7.matched_action 

8. { DELETE | 

9. UPDATE SET * | 

10. UPDATE SET { column1 = value’ } [, ...] } 

11. 

12.not_matched_action 

13. { INSERT * | 

14. INSERT (column1 [, ...] ) VALUES (value? [, ...]) 


14.What is the output of the below function when executed with input parameters 1, 3: 
1.def check_input(x,y): 

2. if x< y: 

. X= X+1 

. if x<y: 

X= x+1 

if x <y: 

X = x+1 

. return x 


ONDARY 
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check_input(1,3) 
A. 1 

B. 2 

C. 3 (Correct) 

D. 4 

E.5 

Answer: C 


15.Which of the following is true, when building a Databricks SQL dashboard? 

A. A dashboard can only use results from one query 

B. Only one visualization can be developed with one query result 

C. A dashboard can only connect to one schema/Database 

D. More than one visualization can be developed using a single query result 

E. Adashboard can only have one refresh schedule 

Answer: D 

Explanation: 

the answer is, More than one visualization can be developed using a single query result. In the query 
editor pane + Add visualization tab can be used for many visualizations for a single query result. 
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Graphical user 
interface, text, application 
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